arXiv:1507.02189vl [cs.LG] 8 Jul 2015 


Intersecting Faces: 

Non-negative Matrix Factorization With New Guarantees 

Rong Ge James Zou 

Microsoft Research New England Microsoft Research New England 
rongge@microsoft.com jazo@microsoft.com 


Abstract 

Non-negative matrix factorization (NMF) is a natural model of admixture and is widely used in science and 
engineering. A plethora of algorithms have been developed to tackle NMF, but due to the non-convex nature of the 
problem, there is little guarantee on how well these methods work. Recently a surge of research have focused on a 
very restricted class of NMFs, called separable NMF, where provably correct algorithms have been developed. In this 
paper, we propose the notion of subset-separable NMF, which substantially generalizes the property of separability. 
We show that subset-separability is a natural necessary condition for the factorization to be unique or to have minimum 
volume. We developed the Face-Intersect algorithm which provably and efficiently solves subset-separable NMF 
under natural conditions, and we prove that our algorithm is robust to small noise. We explored the performance of 
Face-Intersect on simulations and discuss settings where it empirically outperformed the state-of-art methods. Our 
work is a step towards finding provably correct algorithms that solve large classes of NMF problems. 


1 Introduction 


In many settings in science and engineering the observed data are admixtures of multiple latent sources. We would 
typically want to infer the latent sources as well as the admixture distribution given the observations. Non-negative 
matrix factorization (NMF) is a natural mathematical framework to model many admixture problems. 


In NMF we are given an observation matrix M € 


where each row of M corresponds to a data-point in 


K™. We assume that there are r latent sources, modeled by the unobserved matrix W G where each row of 

M characterizes one source. Each observed data-point is a linear combination of the r sources and the combination 
weights are encoded in a matrix A G Moreover, in many natural settings, the sources are non-negative and 

the combinations are additive. The computational problem is then is to factor a given matrix M as M = AW, where 
all the entries of M, A and W are non-negative. We call r the inner-dimension of the factorization, and the smallest 
possible r is usually called the nonnegative rank of M. NMF was first purposed by (|Lee & Seung 1999[ ), and has 
been widely applied in computer vision (Lee^&^em^ 2000) l, document clustering ( Xu et al!]' 2003) 1, hyperspectral 
unmixing( |Nascimento & Di'asj|2004HGomez et al. 20071, computational biology ( |DevarajM 20091, etc. We give two 
concrete examples 

Example 1. In topic modeling, M is the n-by-m word-by-document matrix, where n is the vocabulary size and m 
is the number of documents. Each column of M corresponds to one document and the entry M(i, j) is the frequency 
with which word i appears in document j. The topics are the columns of A, and A{i, k) is the probability that topic 
k uses word i. W is the topic-by-document matrix and captures how much each topic contributes to each document. 
Since all the entries of M, A and W are frequencies, they are all non-negative. Given M from a corpus of documents, 
we would like to factor M = AW and recover the relevant topics in these documents. (Note that in this example A is 
the matrix of “sources” and W is the matrix of mixing weights, so it is the transpose of what we just introduced. We 
use this notation to be consistent with previous works (Arora et al. 2012| l.) 

Example 2. In many bio-medical applications, we collect samples and for each sample perform multiple mea¬ 
surements (e.g. expression of 10^ genes or DNA methylation at 10® positions in the genome; all the values are 
non-negative). M is the sample-by-measurement matrix, where M(i, j) is the value of the jth measurement in sample 
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i. Each sample, whether taken from humans or animals, is typically a composition of several cell-types that we do not 
directly observe. Each row of W corresponds to one cell-type, and W{k,j) is the value of cell-type k in measurement 

j. The entry A{i, k) is the fraction of sample i that consists of cell-type k. Experiments give us the matrix M, and we 
would like to factor M = AW to identify the relevant cell-types and their compositions in our samples. 

Despite the simplicity of its formulation, NME is a challenging problem. Eirst, the NME problem may not be 


identifiable, and hence we can not hope to recover the true A and W. Moreover, even ignoring the identifiabilitj Vavasis 


( 2009 1 showed that finding any factorization M = AW with inner-dimension r is an A^P-hard problem. Arora et al. 
(2012 1 showed under reasonable assumptions we cannot hope to find a factorization in time and the best 


algorithm known is 


Moitra 


(2013 I that runs in time 


'■) 


Many heuristic algorithms have been developed for NME but they do not have guarantees for when they would 
actually converge to the true factorization ( Lee & Seung[ 2000[ |Lm| 2007| l. More recently, there has been a surge of 
interest in constructing practical NME algorithms with strong theoretical guarantees. Most of this activity (e.g. 

|et al.| ( |2012| l; [Bittorf et al.| P012| l; [Kumar et al.| ( |2012j l; [Gillisj ( |2012j l; [Gillis & Vavasisj ( |2014| ), see more in 


Arora 


Gillis 


( 2014| l) are based on the notion of separabilit jponoho & Stodden ( 2003] l which is a very strict condition that requires 
that all the rows of W appear as rows in M. While this might hold in some document corpus, it is unlikely to be true 
in other engineering and bio-medical applications. 


Our Results In this paper, we develop the notion of subset separability, which is a significantly weaker and more 
general condition than separability. In topic models, for example, separability states that there is a word that is unique 
to each topic. Subset separability means that there is a combination of words that is unique to each topic. We show 
that subset separability arise naturally as a necessary condition when the NME is identifiable or when we are seeking 
the minimal volume factorization. We characterize settings when subset-separable NME can be solved in polynomial¬ 
time, and this include the separable setting as a special case. We construct the Eace-Intersect algorithm which provably 
and robustly solves the NME even in the presence of adversarial noise. We use simulations to explore conditions where 
our algorithm achieves more accurate inference than current state-of-art algorithms. 


Organization We first describe the geometric interpretation of NME (Sec. 2), which leads us to the notion of subset- 
separable NME (Sec. 3). We then develop our Eace-Intersect algorithm and analyze its robustness (Sec. 4). Our main 
result. Theorem |4.2| states that for subset-separable NME, if the facets are properly filled in a way that depends on 
the magnitude of the adversarial noise, then Eace-Intersect is guaranteed to find a factorization that is close to the true 
factorization in polynomial time. We discuss the algorithm in more detail in Sections 5 and 6, and analyze a generative 
model that give rise to properly filled facets in Section 7. Einally we present experiments to explore settings where 
Eace-Intersect outperforms state-of-art NME algorithms (Sec. 8). Due to space constraints, all the proofs are presented 
in the appendix. Throughout the paper, we give intuitions behind proofs of the main results. 


2 Geometric intuition 


Eor a matrix M G we use G K"* to denote the i-th row of M, but it is viewed as a column vector. Given 

a factorization M — AW, without loss of generality we can assume the rows of M, A, W all sum up to 1 (this can 
always be done, see Arora et al. ( 2012| l). In this way we can view the rows of W as vertices of an unknown simplex, 
and the rows of M are all in the convex hull of these vertices. The NME is then equivalent to the following geometric 
problem: 


NMF, Geometric Interpretation There is an unknown VL-simplex whose vertices are the rows of IL G M™, 
W ^,..., W^. We observe n points ..., M” G K™ (corresponding to rows of M) that lie in the IL-simplex. 

The goal is to identify the vertices of the kL-simplex. 

When clear from context, we also call the W matrix as the simplex, and the goal is to find the vertices of this 
simplex. There is one setting where it is easy to identify all the vertices. 

Definition 2.1 (separability). A NMF is separable if all the vertices W^ ’s appear in the points M^’s that we observe. 
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Separability was introduced in Donoho & Stodden (20031. When the NMF is separable, the problem simplifies as 
we only need to identify which of the points M^’s are vertices of the simplex. This can be done in time polynomial 
in n, m and r ( |Arora et al.| |2012| l. Separability is a highly restrictive condition and it takes advantage of only the 
0-dimensional structure (vertices) of the simplex. In this work, we use higher dimensional structures of the simplex to 
solve the NMF. We use the following standard definition of facets: 


Definition 2.2 (facet). A facet S C [r] of the W-simplex is the convex hull of vertices {W^ : j G S}. We call S a 
filled facet if there is at least one point in the interior of S (orif\S\ = 1 and there is one point M* that is equal to 
that vertex; such M* is called an anchor). 


Conventions When it’s clear from context, we interchangeably represent a facet S both by the indices of its vertices 
and by the convex hull of these vertices. A facet also corresponds to a unique linear subspace Qs with dimension jS”! 
that is the span of {W^ : j G S'}. In the rest of the paper, it’s convenient to use linear algebra to quantify various 
geometric ideas. We will represent a d-dimensional subspace of using a matrix U G the columns of matrix 

U is an arbitrary orthonormal basis for the subspace (hence the representation is not unique). We use Pjj = UU'^ 
to denote the projection matrix to subspace U, and U'^ G to denote an arbitrary representation of the 

orthogonal subspace. For two subspaces U and V of the same dimension, we define their distance to the the sin of the 
principle angle between the two subspaces (this is the largest angle between vectors u, v for u G U and v G V). This 
distance can be computed as the spectral norm \\Pu^ V\\ (and has many equivalent formulations). 


3 Subset Separability 

NMF is not identifiable up to scalings and permutations of the rows of W. Ignoring such transformations, there can 
still be multiple non-negative factorizations of the same matrix M. This arise when there are different sets of r vertices 
in the non-negative oithant that contain all the points AP in its convex hull. For example, suppose M = AW and the 
A matrix has all positive entries. All the points M* are in the interior of the FF-simplex. Then it is possible to perturb 
the vertices of W while still maintaining all of the M*’s in its convex hull. This give rise to a different factorization 
M = Aw. When the factorization is not unique, we may want find a solution where the FF-simplex has minimal 
volume, in the sense that it is impossible to move a single vertex and shrink the volume while maintaining the validity 
of the solution. 

It’s clear that in order for FF to be the minimal volume solution to the NMF, there must be some points M* that 
lie on the boundary of the FF-simplex. We show that a necessary condition for FF to be volume minimizing is for the 
filled facets (facets of FF with points in its interior) to be subset-separable. Intuitively, this means that each vertex of 
FF is the unique intersection point of a subset of filled facets. 

Definition 3.1 (subset-separable). ANMFM = AW is subset-separable if there is a set of filled facets Si,..., Sk C [r] 
such that Vj G [r], there is a subset of Sj.^, Sj.^,..., Sj^. whose intersection is exactly j. 

Proposition 3.1. Suppose W is a minimal volume rank r solution of the NMF M = AW. ThenW is subset-separable. 

It is easy to see that the factorization M = AW is subset-separable is equivalent to the property that for every 
Ji 3"2 G [r], there is a row iof A such that = 0 and f 0. The previously proposed separability condition 
corresponds to the special case where the filled facets ..., Sk correspond to the singleton sets {FF^},..., {FF’'}. 

Example. We illustrate the subset-separable condition in Figure In this figure, the circles correspond to data 
points M”s and they are colored according to the facet that they belong. The filled facets are Si — { 1 }, 5'2 = 
{3}, Sj, = (1, 2} and 5'4 = (2, 3}. The facet (FF^, FF^} is not filled since there are no points in its interior. The 
singleton facets Si and S 2 are also called anchors. This NMF is subset-separable since FF^ is the unique intersection 
of 5*3 and S' 4 , but it is not separable. The figure also illustrates the corresponding A matrix, where the rows are grouped 
by facets and the shaded entries denote the support of each row. 

The geometry of the simplex suggests an intuitive meta-algorithm for solving subset-separable NMFs, which is 
the basis of our Face-Intersect algorithm. 

1. Identify the filled facets, ..., Sk, r < k < n. 
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Figure 1; Illustration of the NMF geometry. 


2. Take intersections of the facets to recover all the rows of W (vertices of the simplex). 

3. Use M and W to solve for A. 


4 Robust algorithm for subset-separable NMF 


In order to carry out the meta-algorithm, the key computational challenge is to efficiently and correctly identify the 


hlled facets of the W simplex. Finding hlled facets is related to well-studied problems in subspace clustering (Vi¬ 


dal 20101 and subspace recovery!Hardt & Moitra 20131l. In subspace clustering we are given points in k different 


subspaces and the goal is to cluster the points according to which subspace it belong to. This problem is in general 
NP-hard (Elhamifar & Vidal 20091 and can only be solved under strong assumptions. Subspace recovery tries to hnd 
a unique subspace a fraction p of the points. Hardt & Moitra ( 2013| l showed this problem is hard unless p is large 
compared to the ratio of the dimensions. Techniques and algorithms from subspace clustering and recovery typically 
make strong assumptions about the independence of subspaces or the generative model of the points, and cannot be 
directly applied to our problem. Moreover, our filled facets have the useful property that they are on the boundary 
of the convex hull of the data points, which is not considered in general subspace clustering/discovery methods. We 
identified a general class of hlled facets, called properly filled facets that are computationally efficient to hnd. 


Definition 4.1 (properly hlled facets). Given a NMF M = AW, a set of facets Si, ...,Sk € [r\ofW is properly filled 
if it satisfies the following properties: 


1. For any facet 15^1 > 1, the rows of A with support equal to Si (i.e. points that lie on this facet) has a |S'i| — 1- 
dimensional convex hull. Moreover, there is at least one row of A that is in the interior of the convex hull. 

2. (General positions property.) For any subspace of dimension 1 < t < r, if it contains more than t rows in M, 
then the subspace contains at least one Si which is not a singleton facet. 


Condition 1 ensures that each Si has sufficiently many points to be non-degenerate. Condition 2 says that points 
that are not in the lower dimensional facets Si, ...,Sk are in general positions, so that no random subspace look like 
a properly hlled facet. A set of properly hlled facets Si, ...,Sk may contain singleton sets corresponding some of the 
rows if these rows also appear as rows in M. We hrst state the main results and then state the Face-Intersect 
algorithm. 

Theorem 4.1. Suppose M = AW is subset separable by Si,..., Sk and these facets are properly filled, then given M 
the Face-Intersect algorithm computes A and W in time polynomial in n, m and r (and in particular the factorization 
is unique). 

In many applications, we have to deal with noisy NMF M = AW noise where (potentially correlated) noise is 
added to rows of the data matrix M. Suppose every row is perturbed by a small noise e (in £2 norm), we would like 
the algorithm to be robust to such additive noise. We need a generalization of properly hlled facets. 
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Definition 4.2 ({N, H, 7 ) properly filled facets). Given a NMF M = AW, a set of facets Si,Sk G [r] ofW is 
(N, H, 7 ) properly filled if it satisfies the following properties: 

1. In any set |S'i | > 1, there is a row i* in A whose support is equal to Si, and is in the convex hull of other rows of 

A. There exists a convex combination M* = X]i6[n]\i* such that the matrix 

has rank |S'i |, and the smallest nonzero singular value is at least 7 . We call this special point M* the center/or 
this facet. 

2. For any set |5'i| > 1, there are at least N rows in A whose support is exactly equal to Si. 

3. For any subspace Q of dimension \ < t < r, if there are at least N rows of M in an e-neighborhood of Q, then 

there exists a non-singleton set Si with corresponding subspace Qi such that < Fie. 

Intuitively, if we represent the center point as a convex combination of other points, the only points that have a 
nonzero contribution must be on the same facet as the center. Condition 1 then ensures there is a “nice” convex combi¬ 
nation that allows us to robustly recover the subspace corresponding to the facet even in presence of noise. Condition 
2 shows every properly filled facets contain many points, which is why they are different from other subspaces and are 
the facets of the true solution. Condition 3 is a generalization of the general position propery, which essentially says 
“every subspace that contains many points must be close to a properly filled facet”. In Section|7]we show that under a 
natural generative model, the NMF has {N, H, 7 )-properly hlled facets with high probability. 

Properly filled facets is a property of how the points are distributed on the facets of W. The geometry of the 
PF-simplex itself also affects the accuracy of our Face-Intersect algorithm. 

Definition 4.3. A matrix W € < m) is a-robust if its rows have norm bounded by 1, and its r-th singular 

value is at least a. 

Under these assumptions we prove that Face-Intersect robustly learns the unknown simplex W. 

Theorem 4.2. Suppose M = AW is subset separable by Si,..., Sk and these facets are (N, H, 7 ) properly filled, and 
the matrix W is a-robust. Then given M whose rows are within distance e to M, with e < o{a'^y/Hr^), Algorithm 
Face-Intersect finds W such that there exists a permutation tt and for all i \\Wi — lT/.(i)|| < 0{Hr^e/a^y). The 
running time is polynomial in n, m and r. 


Algorithm 1 Face-Intersect 

Run Algorithm[^to find subspaces that correspond to properly filled facets Si, S2, ■■■, Sk where \Si\ > 2. 
Run Algorithm 5 to find the intersection vertices P. 

Run Algorithm[^(similar to Algorithm 4 in Arora et al. (20131) to find the singleton points (anchors). 
Given M, W, compute A. 


A vertex j G [r] is an intersection vertex if there exists a subset of properly filled facets {S'j,. : | > 2} such 

that j = CikSj^. Since the first module of Face-Intersect, Algorithm]^ only hnds non-singleton facets, the intersection 
vertices are all the vertices that we could find using these facets. The last module of Face-Intersect hnds all the 
remaining vertices of the simplex. 


Our approach The main idea of our algorithm is to hrst hnd the subspaces corresponding properly hlled facets. 


then take the intersections of these facets to hnd the intersection vertices. Finally we adapt the algorithm from Arora 


et al. (2013 1 to hnd the remaining vertices that correspond to singleton sets. 


Finding facets For each row of M, we try to represent it as the convex combination of other rows of M. We use 
an iterative algorithm to make sure the span of points used in this convex combination is exactly the subspace 
corresponding to the facet. 

Removing false positives The previous step will generate subspaces that correspond to properly hlled facets, but 
it might also generate false positives (subspaces that do not correspond to any properly hlled facets). Condition 
3 in Dehnition 4.2 allows us to hlter out these false positives as these subspaces will not contain enough nearby 
points. 
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Finding intersection vertices We design an algorithm that systematically tries to take the intersections of sub¬ 
spaces in order to find the intersection vertices. This relies on the subset-separable property and robustness 
properties of the simplex. This step computes at most 0{nr) subspace intersection operations. 

Finding remaining vertices The remaining vertices correspond to the singleton sets. This is similar to the 
separable case and we use an algorithm from Arora et al. (|2013[). 


5 Finding properly filled facets 

In this section we show how to find properly filled facets Si with | S'i | > 2. The singleton facets (anchors) are not 
considered in this section, since they will be found through a separate algorithm. We first show how to find a properly 
filled facet if we know its center (Condition 1 in Definition |4.2| i. Then to find all the properly filled facets we enumerate 
points to be the center and remove false positives. 

Finding one properly filled facet Given the center point, if there is no noise then when we represent this point as 
convex combinations of other points, all the points with positive weight will be on the same facet. Intuitively the span 
of these points should be equal to the subspace corresponding to the facet. However there are two key challenges here: 
first we need to show that when there is noise, points with large weights in the convex combination are close to the 
true facet; second, it is possible that points with large weights only span a lower dimensional subspace of the facet. 
Condition 1 in Definition |4.2| guarantees that there exists a nice convex combination that spans the entire subspace (and 
robustly so because the smallest singular value is large compared to noise). In Algorithm]^ we iteratively improve our 
convex combination and eventually converge to this nice combination. 


Algorithm 2 Finding a properly filled facet 


input points 0^,..., O", and center point (Condition 1 in Definition 4.2 1 . 
output the proper facet containing v^. 

1 : Maintain a subspace Q (initially empty) 

2 : Iteratively solve the following optimization program: 


max tr(Pgi ^ 

Vz € [n] Wi >0 

n 

= 1 

Z=1 


||t)°-^u;,P|| <2e 

diag((5^ I ) Q) > 7/2- 


3: Let Q be the top singular space of for singular values larger than j/2d. 

4: Repeat until the dimension of Q does not increase. 


Theorem 5.1. Suppose ||t)* — u*|| < e, is the center point of a properly filled facet S C [r] with = d, and the 
unknown simplex W is a-robust, when dy/rejaj 1 Algorithm^stops within d iterations, and the subspace Q is 
within distance 0{^s/rcla'f) to the true subspace Qg- 

The intuition of Algorithm is to maintain a convex combination for the center point. We show for any convex 
combination, the top singular space associated with the combination, Q, is always close to a subspace of the true 
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space Qs- The algorithm then tries to explore other directions by maximizing the projection that is outside the current 
subspace Q (the objective function of the convex optimization), while maintaining that the current subspace have large 
singular values (the last constraint). In the proof we show since there is a nice solution, the algorithm will always be 
able to make progress until the hnal solution is a nice convex combination. 


Finding all subsets Algorithm can find one properly filled facet, if we have its center point (Condition 1 in 
Definition 4.2 1 . In order to find all the properly filled facets, we enumerate through rows of M and prune false 
positives using Condition 3 in Dehnition|4.2| 


Algorithm 3 Finding all proper facets 

input M whose factorization is subset-separable with {N, H, 7 )-properly hlled facets. 

1 : for i = 1 to n do 

2 : Let = M* and v ^,..., be the rest of vertices. 

3 : Run Algorithm|^to get a subspace Q. 

4 : If dim(Q) < r, and there are at least N points that are within distance 0{y^e/aj) add it to the collection of 

subspaces. 

5 : end for 

6 : Let (5 be a subspace in the collection, remove Q if there is a subspace Q' with dim(Q') < dim((5) and WPq± Q'\\ < 

0{H^yre/a^) 

7 : Merge all subspaces that are within distance 0{Hy/re/a'y) to each other. 


Theorem5.2. IfHy/re/a'j = o{a), then the output of Algorithm^contains only subspaces that are es = 0{H^Jre/a'y)- 
close to the properly filled facets, and for every properly filled facet there is a subspace in the output that is es close. 


6 Finding intersections 

Given an subset-separable NMF with (N, H, 7 )-properly hlled facets, let Qi denote the subspace associated with a 
set Si of vertices: Qi = span(kF'^'). For all properly hlled facets with at least two vertices. Algorithm returns 
noisy versions of the subspaces Qi that are eg close to the true subspaces. Without loss of generality, assume the hrst 
h facets are non-singletons. Our goal is to hnd all the intersection vertices {kF® : i G P}. Recall that intersection 
vertices are the unique intersections of subsets of Si,Sh- We can view this as a set intersection problem: 

Set Intersections We are given sets Si, S 2 , ■■■, S^, C [r]. There is an unknown set P C [r] such that Vi G P there 
exists {Si,,} and i = HkSi,,. Our goal is to hnd the set P. 

This problem is simple if we know the subsets of in each facet. However, since what we really have access to 
are subspaces, it is impossible to identify the vertices unless we have a subspace of dimension 1. On the other hand, 
we can perform intersection and linear-span for the subspaces, which correspond to intersection and union for the 
sets. We also know the size of a set by looking at the dimension of the subspace. The main challenge here is that we 
cannot afford to enumerate all the possible combinations of the sets, and also there are vertices that are not intersection 
vertices and they may or may not appear in the sets we have. The idea of the algorithm is to keep vertices that we have 
already found in R, and try to avoid hnding the same vertices by making sure S is never a subset of R. We show after 
every inner-loop one of the two cases can happen: in the hrst case we hnd an element in P; in the second case S is 
a set that satishes {S\R) n P = 0, so by adding S' to P we remove some of the vertices that are not in P. Since the 
size of R increases by at least 1 in every iteration until R = [r], the algorithm always ends in r iterations and hnds all 
the vertices in P. In practice, we implement all the set operations in [fusing the analogous subspace operations (see 
Algorithm 6 in Appendix). We prove the following : 

Theorem 6.1. When W is a-robust and es < o{a^Algorithm 6 finds all the intersection vertices ofW, with 
error at most Cy = 4r^ ®es/a. 
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Algorithm 4 Finding Intersection 
input k sets Sh- 

output A set P that has all the intersection vertices. 
Initialize P = %,R = %. 
for i = 1 to r do 
Let S = [r] 

for j = 1 to do 

if [S' n 5'j-1 < [S'! and S r\Sj % R then 
S = SnSj 

end if 
end for 

R = R\JS 

Add5toPif |S'| = 1. 

end for 


Algorithm 5 Finding remaining vertices 

input matrix M, intersection vertices W^, 
output remaining vertices . 

for i = |P| + 1 TO r do 
Let Q = span{VL\ 

Pick the point with largest ||PQiM-^ ||, let TL* = ML 

end for 


Arora 


Finding the remaining vertices The remaining vertices correspond to singleton sets in subset-separable assumption. 
They appear in rows of M. The situation is very similar to the separable NMF and we use an algorithm from 
to find the remaining vertices. For completeness we describe the algorithm here. By Lemma 4.5 in 
we directly get the following theorem: 


et al. 

2013 

et al. 

(2013 


Arora 


Theorem 6.2. If vertices already found have accuracy €y such that Cy < aj20r, Algorithm^outputs the remaining 
vertices with accuracy Oicjcf) < e„. 


Running time. Face-Intersect (Algorithm 1) has 3 parts: find facets (Algorithm 3), find intersections (Algorithm 4) 
and find remaining anchors (Algorithm 5). We discuss the runtime of each part. We first do dimension reduction to 
map the n points to an r-dimensional subspace to improve the running time of later steps. The dimension reduction 
takes 0{nmr) time, where n, m are the number of rows and columns of M, respectively, and r is the rank of the 
factorization. Algorithm 3’s runtime is 0(nd ■ OPT), where d is the max dimension of properly filled facets (typically 
d < r m). OPT is the time to solve the convex optimization problem in Algorithm 2. OPT is essentially equivalent 
to solving an LP with n nonnegative variables and r + d constraints. Algorithm 4’s runtime is 0{kr^) where k is the 
number of properly filled facets; typically k n. Algorithm 5’s runtime is 0{nr^). The overall runtime of Face- 
Intersect is 0{mnr -f nd ■ OPT + kr^ + nr^). Calling the OPT routine is the most expensive part of the algorithm. 
Empirically, we find that the algorithm converges after ^ k ^nd calls to OPT. 


7 Generative model of NMF naturally creates properly filled facets 

To better understand the generality of our approach, we analyzed a simple generative model of subset-separable NMFs 
and showed that properly filled facets naturally arise with high probability. 

Generative Model Given a simplex W that is a-robust and a subset of facets Si, S 2 , ■■■, Sk that is subset separable. 
Let Pi be the probability associated with facet i, and let Pmin = wimi^kPi and d = maxjgffe] |S'i|. For convenience. 
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Figure 2: Reconstruction accuracy of the three NMF algorithm as a function of data noise. Standard error shown in 
the error bars. 


denote Sq = [r] and po = 1 — To generate a sample, first sample facet Si with probability pi, and then 

uniformly randomly sample a point within the convex hull of the points {W^ : j G Si}. Here we think of d as a small 
constant or 0((logn)/loglogn) (in general d can be much smaller than r). For example, separability assumption 
implies d = 1, and it is already nontrivial when d = 2. 

Theorem 7.1. Given n = H(max{(4d)‘^ log(d/?7), log(d/pmm??)}/Pmm) samples from the model, with high 
probability the facets Si,Sk are {prairJtl^, 200r^-®/pmi„Q;, a^/16d) properly filled. 

The proof relies on the following two lemmas. The hrst lemma shows that once we have enough points in a 
simplex, then there is a center point with high probability. 

Lemma 7.2. Given n — H((4d)‘^ log d/p) uniform points ...,v'^ in a standard d-dimensional simplex (with 

vertices ei, 62 , ...,ed), with probability 1 — p there exists a point Vi suchthatVi = '^j^iWjV^ (wj > = 1), 

and > l/16d. 

The next lemma shows unless a subspace contains a properly hlled facet, it cannot contain too many points in its 
neighborhood. 

Lemma 7.3. Given n = f2(d^ log(d/pmint?) /Pmin) uniform points v^,v'^,..., u" in a standard d-dimensional simplex 
(with vertices ei, 62 , cf), with probability 1 — rjfor all matrices A whose largest column norm is equal to 1, there 
are at most pmintt/'^points with ||4l?;*|| < pmm/200d. 


8 Experiments 

While our algorithm has strong theoretical guarantees, we additionally performed proof-of-concept experiments to 
show that when the noise is relatively small, our algorithm can outperform the state-of-art NMF algorithms. We 
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simulated data according to the generative NMF model described in Section 7. We first randomly select r non-negative 
vectors in M"* as rows of the W matrix. We grouped the vertices into r groups, Si,Sr of three elements each, 
such that each vertex is the unique intersection of two groups. Each Si then corresponds to a 2-dim facet. To generate 
the A matrix, for each Si, we randomly sampled ni rows of A with support Si, where each entry is an i.i.d. from 
Unif(0,1). An additional n 2 rows of A were sampled with full support. These correspond to points in the interior of 
the simplex. We tested a range of settings with m between 5 to 100, r between 3 to 10, and ni and n 2 between 100 
and 500. We generated the true data as M = AW and added i.i.d. Gaussian noise to each entry of M to generate the 
observed data M. 

There are many algorithms for solving NMF, most of them are either iterative algorithms that have no guarantees, 
or algorithms that work only under separability condition. We choose two typical algorithms: the Anchor-Words 
algorithm! Arora et al. 2013| l for separable NMF, and Projected Gradient ( Lin[ 2007[ ) for iterative algorithms. For 
each simulated NMF, we evaluated the output factors A, W of these algorithms on three criteria: accuracy of the 
reconstructed anchors to the true anchors, \\W — W\\ 2 ', accuracy of the reconstructed data matrix to the observed 
data, \\M — AM^|| 2 ; accuracy of the reconstructed data to the true data, ||M — AfF|| 2 . In Figure]^ we show the 
results for the three methods under the setting rii = 100, n 2 = 100, m = 10, r = 5. We grouped the results by 
the noise level of the experiment, which is dehned to be the ratio of the average magnitude of the noise vectors to 
the average magnitude of the data points in K™. Face-Intersect is substantially more accurate in reconstructing the 
W matrix compared to Anchor-Words and Projected Gradient. In terms of reconstructing the M and M matrices, 
Face-Intersect slightly outperforms Anchor-Words {p < 0.05 t-test), and they both were substantially more accurate 
than Projected Gradient. As noise level increased, the accuracy of Face-Intersect and Anchor-Words degrades and 
at noise around 12.5%, the accuracy of the three methods converged. In many applications, we are more interested 
in accurate reconstruction of the latent W than of M. For example, in bio-medical applications, each row of M is 
a sample and each column is the measurement of that sample at a particular bio-marker. Each sample is typically a 
mixture of r cell-types, and each cell-type corresponds to a row of W. The A matrix gives the mixture weights of the 
cell-types into the samples. Given measurement on a set of samples, M, an important problem is to infer the values 
of the latent cell-types at each bio-marker, W (Zou et al. |2014| l. To create a more realistic simulation of this setting, 
we used DNA methylation values measures at 100 markers in 5 cell-types (Monocytes, B-cells, T-cells, NK-cells and 
Granulocytes) as the true W matrix ( Zou et ah] 2014| l. From these 5 anchors we generated 600 samples-which is a 
typical size of such datasets-using the same procedure as above. Both Face-Intersect and Anchor-Words substantially 
outperformed Projected Gradient across all three reconstruction criteria. In terms of reconstructing the biomarker 
matrix W, Face-Intersect was signihcantly more accurate than Anchor-Words. For reconstructing the data matrices M 
and M, Face-Intersect was statistically more accurate than Anchor-Words when the noise is less than 8% (p < 0.05), 
though the magnitude of the difference is small. 


Discussion We have presented the notion of subset separability, which substantially generalizes separable NMFs and 
is a necessary condition for the factorization to be unique or to have minimal volume. This naturally led us to develop 
the Face-Intersect algorithm, and we showed that when the NMF is subset separable and have properly hlled facets, this 
algorithm provably recovers the true factorization. Moreover, it is robust to small adversarial noise. We show that the 
requirements for Face-Intersect to work are satished by simple generative models of NMFs. The original theoretical 
analysis of separable NMF led to a burst of research activity. Several highly efficient NMF algorithms were inspired by 
the theoretical ideas. We are hopeful that the idea of subset-separability will similarly lead to practical and theoretically 
sound algorithms for a much larger class of NMFs. Our Face-Intersect algorithm and its analysis is a hrst proof-of- 
concept that this is a promising direction. In exploratory experiments, we showed that under some settings where the 
relative noise is low, the Face-Intersect algorithm can outperform state-of-art NMF solvers. An important agenda of 
research will be to develop more robust and scalable algorithms motivated by our subset-separability analysis. 
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A Subset Separability and minimal volume 

In this section we prove Proposition |3.1 [ subset separability condition is necessary for a minimal volume solution. 

Proof. Suppose M = AW is a rank-r nonnegative matrix factorization with minimal volume. If this decomposition 
does not satisfy the subset-separable condition, then there exists i j G R such that for every row the two entries 
At,i, Atj are either all zero or all nonzero. That is, the columns Ai and Aj have the same support. Consider a new 
factorization WW', where the columns of A are the same as columns of A except for columns z, j, and rows of W' 
are the same as rows of W' except for row i. 

Let A'i = j^A„ A'j = Aj - j^A„ and {W'y = (1 - e)W^ + eWf it is easy to verify that A'W' = AW = M, 
and W' is still nonnegative for e G [0,1]. 

Since the support of Ai and Aj are the same, there exists a positive e such that A'^ is still a nonnegative vector. 
In that case AW' is a valid nonnegative matrix factorization where only one row of W' is different from W. By 
construction it is clear that the volume of W' is equal to (1 — e) times the volume of W, so this contradicts with the 
assumption that M = AW is a factorization with minimal volume. □ 

B Detailed analysis for finding properly filled facets 

In this section we analyze Algorithms]^ and 

B.l Finding one properly filled facet 

We first prove Theorem jS.lj For this algorithm, it is more natural to use the following robustness condition, which is 
a corollary of a-robustness. 

Lemma B.l. Suppose the vertices of the unknown simplex are rows ofW G and W is a-robust. For any face 

S of W with corresponding subspace Q, there exists a unit vector h A Q, let v be any vector in the simplex and 
be its component that is orthogonal to Q, then 

Proof Suppose Q has dimension d (we know d < r). Let B be the projection of W to the orthogonal subspace of 
Q, and remove the 0-columns in B. The matrix B is a n x {r — d) matrix whose smallest singular value is at least 
a (the smallest singular value in a projection is at least the smallest singular value of the matrix). We construct h as 
h = -pyiji. By the property of B we know h ■ B^ = = aj^fr. 

For any vector v in the simplex, its orthogonal component u-*- is equal to Pq± (X]i=i WiW'), which is a nonnega¬ 
tive combination of columns in B. Therefore J = ^ > max,- = a! Jr (here we used the fact that 

IL^II IIEi ^zBiW — * ||Bi|l ' ^ 

h ■ Bi are all positive). □ 

As we explained, there are two challenges in proving Theorem jS.lj 1). the observations are noisy. We would like 
to show even with the noisy D’s, the subspace Q is always close to a subspace of the true space Q; 2). the convex 
combination may not find the entire space Q, for which we show the dimension of Q will increase until it is equal to 
the dimension of Q. Throughout this section we will use d to denote the dimension of true space Q. 

We first show that in every step of the algorithm all the vectors in Q are close to the subspace Q. We start by 
proving a general perturbation lemma for singular subspaces: 

Lemma B.2. Let F = F E where both F and F are positive semidefinite, F is a rank d matrix with column span Q, 
andU isthetopt(t < d) singular space of F with the t-th singular value at{F) > ||i?||, then llTgiC/jl < ||i?||/crt(i^). 

Proof Let UDU"^ be the truncated top t SVD of F, and UDU"^ be the full SVD. We know 

WPqxUDW = ||Pq..F|| = \\Pq^E\\ < \\E\\, 

where the first equality is because U is an orthonormal matrix, and the second equality is because the column span of 
E is inside Q. 
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On the other hand, Pqj_J 7I? is a submatrix of Pgit/ii), so we know ||PQi[/Z3|j < |jPQi?7_D|| < ||i?||. Since all 

the entries in D are at least this implies ||Pqj-{ 7|| < ||P||/crt(P). □ 

In the later proofs we usually think of F as (X]r=i and F as Pq) ■ The next 

lemma shows that for any feasible solution of the optimization program (even just considering the hrst three con¬ 
straints), the matrix F is close to F: 

Lemma B.3. For any feasible solution that satisfies the first three constraint, let F = (X)”- ^iWiF{F)'^) and F = 
{^'i^i'WiPQVivJP q), we have ||P|| = ||F — F|| < 0{^e/a). In fact, even the nuclear nor»|ij \\F-F\\* < 
Oi^/reja). 

Proof Let F = we show F is close to both F and F. Let 5i = F — u®. By assumption we 

know Pill < e. Also, by assumption ||r;®|| < 1 (normalization) so ||P — P|| < Wi\\5i{v'‘)'^ + v'^Sf + SiSf\\ < 

(2e + e^)Er=i«^* = 0(e). 

On the other hand, by the third constraint we know p® —X]r=i — 2e, which implies p® —X]r=i ^*'*^*11 — 4e 

(because ||u* —-0*11 < e and Wi’s form a probability distribution). Using the robustness condition, let u®-*- = — PgVi, 

then 

n 

4e > II ^ WiV^ — ^ Wihj ■ p-*- > A ^ Wi||u*-‘-|| 

i i i—1 

Therefore we know ||F — F|| < J2i=i'^i\\F'^iv’')'^P q + PQUi(u*-*-)^ + < 0{yf-e/a) (note that 

< 1 by normalization). 

The nuclear norm bound follows from exactly the same proof. □ 

The previous two lemmas guarantee that at any time of the algorithm, the subspace Q is always close to a subspace 
of Q. In the next lemma we show that the algorithm makes progress 

Lemma B.4. If dim{Q) = t < d, then in the next iteration the dimension ofQ increases by at least 1. 


Proof Since is a center of the facet, we know there exists a “nice” solution w* such that = X]r=i '^tF and 
X]r=r '^iiF){F)'^ has d-th singular value 7 . 

We hrst show that this guaranteed good solution w* is always a feasible solution. Clearly it satishes the hrst three 
constraints (by triangle inequality). For the last constraint, let F* be the F-matrix constructed be w* and F* be the 
corresponding F matrix. By assumption we know F* has afiF*) > 7 , so in particular for any direction u in subspace 
Q, vT'F*u > 7 . Since by previous two lemmas we have \\Pq ±Q\\ < Oid^rtjafi), in particular every column of Q 


is within Oidy/reja'f) with its projection in Q, we know Ai&g{Cf"F*Q) > I 7 (when yjrcjay is smaller than some 
universal constant). Now by Lemma B.3 F* and F* are close (in spectral norm) we have diag(Q^F*Q) > 7 / 2 . 


Since the solution w* is feasible, the optimal solution must have objective value no less than the objective value of 
vf. By the nuclear norm bound, for any subspace Q we know 


triPQ^FP^^) -tv{PQ^FP^^) = tr{PQ^iF - F)Pq^) < \\Pq^{F - F)Pq^\U < ||F-F|U < 0(v^e/a), 


where we used the fact that the trace of a matrix is always bounded by its nuclear norm, and nuclear norm of a 
projection is always smaller than nuclear norm of the original matrix 

On the other hand ufFg^F*P q±) > tr(F*) — Yl\=i ^i{F*) > J2i=t+i ^i{F*) > y{d — t). So the optimal 
objective value must be at least y{d — t) — O^yprtja). 

Let w be the optimal solution and F, F be the corresponding matrices, by the same argument we know 


tr(pTFPT) > tr(pTFPT) _ 0{Vrela) > y{d - t) - 0{^frtla). 

^Nuclear norm ||M||* is equal to the sum of singular values of M, it is also the dual norm of spectral norm in the sense that ||M||* = 
max|j^|j<]^{A, M). 

“This follows from the fact that || A|| * = max||5j|<i (A, B) and spectral norms do not increase after projection. 
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However, is a matrix of rank at most d, therefore || > l/d — O^yprtj ad). For the F matrix we also have 

||P^^P^|| > 7 / 2 dbecause ||F — P|| is small. 

Now for the matrix F, there are f + 1 orthogonal directions {t from Q, and at least one orthogonal to Q) with 
singular value at least 7 / 2 c?, hence at+i{F) > "f/^d. As a result in the next step the dimension of Q increases by at 
least 1 . □ 


Now we are ready to prove Theorem 5.1 


Proof, (of Theorem 5.1 1 . By Lemma 

dim((5)- 


B.4 


and Lemma 


B.3 


we know when the algorithm ends we must have dim(Q) > 

and Lemma 


Now by the last constraint, we know WiF{F)'^) > 7 / 2 . Combined with Lemma 

this implies the final subspace is within distance 0 {y/re/a'y). 


B.3 


B.2 


□ 


B.2 Finding all properly filled facets 

Theorem|5.2|follows immediately from Theorem|5.1| for completeness we provide the proof here. 


Proof, (of Theorem |5.2[ ) By Theorem |5.1| and by Condition 2 in Definition 4.2 when we run Algorithm[^on a correct 
center point, the resulting subspace will always be added to the collection. Therefore at the end of the loop for each 
facet Si with at least two vertices, and its corresponding subspace Qi, there must be a Qi in the collection that is 
0(-y/re/Q;7)-close. 


On the other hand, by Condition 3 in Definition 
satisfy \\P, 


4.2 


we know every subspace Q that is in the collection must 

Pr 


Q±Qi\\ < 0{Hy/re/a'^) for some true subspace Qi. If Q has dimension larger than Qi, then IjPg^QiH < 


iPgiQi 


+ \\P< 


Q 


-Qi\\ < 0{Hyf^e/ajj^ 


Therefore all the false positives with higher dimension are removed. The 


remaining subspaces must be 0 {F[^/re/a'y)-close to one of the true subspaces. 

By the a-robustness condition, two subspaces corresponding to different facets must have distance at least a, so 
when F[^/re/a'y < o{a) the subspaces Q close to a true subset cannot be removed. Also, in the last step it is easy to 
identify the subspaces Q that are close to one true space Qi, any one of those will be es-close to the true subsets. □ 


C Detailed Analysis for finding intersections 


In this section we first prove Theorem |6.1[ then we discuss how to apply Algorithm]^ from Arora et al. (2013 1 to find 
the remaining vertices. 

The main idea of the implementation is that the subspace Z will always be close to the span of {kC* : i G S} 
where S = The subspace T will correspond to span of {kL* : i G R} where R is the set of points that we 

have already found. If ||Ppj_Z|| is large then it means S is not a subset of R. 

For this step we also need a particular corollary of the a-robustness condition. 


Lemma C.l. Suppose the vertices of the unknown simplex are rows of W G and W is a-robust. Let 

Qi,Q 2 , ...jQt be a set of faces that has intersection S C [r], and (5^ be an arbitrary basis for the orthogonal subspace 
ofQi. The matrix E = [Qj~, Q 2 ,..., Qjy]"^ has a null-space equal to span{W^ : i G S} and cr„_ |S|(E) > a/v^. 

Proof. Clearly all the vectors {kF® : i G S} are in the null-space of E as kF* G Qj for all j G [f]. For vectors that 
are orthogonal to the span of columns of kF, they have projection 1 in all of Pq± ’s, and they do not influence the 
projections within the row span of kF. We only need to prove that within the row span of kF, for all the directions 
orthogonal to {kF® : i S S'} the matrix still has large singular values. 

Let Sj be the set of vertices that Qj contains, we define S' as follows: S{ = [r]\Si, for all j > 1 S' = 
[r]\ ((Uj'^jS',) U Sj). Since S is the intersection of the veiticies, we know US' = [r]\S. Also by construction we 
know the S'’s are disjoint. For each S', let QI be the span of rows of kF with indices in [r]\S'. Since [r]\S' is a 
superset of Sj, we know Qj is a subspace of QI and hence P(q'.)-l A Pq± . For each j construct to be the matrix 

^This uses the variational characterization of V = maxu^u y^Y sin 0(u, v) where 0(u, v) is the angle between n, v. 
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Algorithm 6 Finding Intersection 

input fc subspaces Qi,Oft,- 

output intersection vertices W’' that corresponds to {VF* : i G P} 

Let ev = 4r^ ®es/a, ey = 2revja. 

Maintain list of vertices {fF*}, matrix Y, and subspace F that correspond to the left singular space of Y with 
singular values larger than a/2. 
for i = \ TO r do 

Maintain set {7 C [A:], E = [Qj- : i G U] and Z be the space of left singular vectors of E with singular values at 
most res- 
Initialize S' = 0 
for j = 1 TO h do 

Let E' = E + Pq±_ , Let Z' be the space of eigenvectors of E with eigenvalues at most res- 

if dim(Z') < dim(Z) and ||PriZ'|| > a/2 then 
let [/ = [/ U {j}, replace E and Z by E', Z'- 

end if 
end for 

Append Z ioY {Y = [E, Z]), update F. 

If dim{Z) = 1 then add the direction to list of W 

end for 


that is an (arbitrary) orthogonal basis of the orthogonal subspace of Q'j in span of IF. Let B = B^, i?‘] G 

M"x(»’-i) We know BB^ ^ Therefore we only need to show the matrix B has large smallest 

singular value. 

Now consider the product WB- By construction of B, this is a block diagonal matrix (with blocks correspond 
to S/’s)- Since IF is a-robust we know each block has smallest singular value a. Therefore amin{W B) > a, and 
o'miniB) > ci/liW^II ^ ct/y/r- By the relationship between E and B we know tT„_i(E) > <Jmin{B) > aj yfr- □ 

This lemma allows us to take the intersections of subspaces robustly. 

We prove the theorem by induction. The induction hypothesis is 

Claim C.l. Af the end of every outer-loop, VF* ’s are ey = Ar^'^es/a close to some vertices in IF, F is ey = 2rey fa- 
close to a subspace spanned by where R C [r] is a subset of vertices. The set R never contains any vertex in P 
that is not already close to one of the elements in the list IF*. 


Clearly this hypothesis is true before the first iteration (everything was empty). Next we analyze the inner-loop of 
the algorithm. During the inner-loop the algorithm maintains the following properties: 

Lemma C.2. The set U always has size at most r — 1, the subspace Z is always Cy = Ar^'^es ja-close to the subspace 
spanned by {Ai ■. i G C\j^ uSj}. 


Proof After the first element is added to U, the dimension of Z is equal to the dimension of some Qj, which is at 
most r — 1. Every time we add an element to U the dimension of Z decreases by 1, and when dim{Z) becomes 1 the 
algorithm stops. So there must be at most r — 1 elements in U. By Lemma C.l we know if the matrix consist of the 
true Qj~, then it has nullspace equal to the span of {IF® : i G Dj^uSj}, and all the other directions have eigenvalue 
at least a/y/r. The difference betwee n E and the true matr ix is at most 2res, so when 2res < a/Ay/r by matrix 
perturbation bounds (Wedin’s Theoreir Stewart & Sun ( 1990 1 ) we know Z is always Ar^'^es / a-close. □ 


Another property is that in the intersection Dj^uSj there is always an element that is not already found. 


Lemma C.3. ||Tr^ Z\\ > a/2 if and only iflAj^jjSj contains at least one element outside of R. Further, this ensures 
S = Fj^uSj always contains at least one element outside of R during the inner-loop. 
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Proof. This is because by induction hypothesis T is ey-close to the row span of W^. On the other hand by Lemma C.2 
we know Z is close to the row span of . If fljgyS'j C R then the row span of is a subspace of 

the row span of and ||Pr^ -^11 < ey -I- a/2. 

On the other hand, if Oj^uSj has an element that is outside R, then since W is a-robust, there is a direction in 
ppnjeuSj distance at least a to the row span of W^. By triangle inequality the distance between Py-^Z > 

a — ey — ey > a/2. 

The last statement of the lemma then follows directly because this is true initially {S = [r] 
conditions in the if-statement ensures this property is preserved. 


initially) and the 

□ 


Using these two properties we know whenever the inner-loop adds a point to the list kU* then it must be e„ close 
to one of the unfound kU*’s (which is the first part of the induction hypothesis). Next we prove if at the end of the 
inner-loop dim{Z) is more than 1, then Dj^uSj does not contain any vertices in P\R. 

Lemma C.4. Ifdim{Z) is more than 1 after the inner-loop, then does not contain any vertices in P\R. 

Proof. Assume towards contradiction that dim(Z) > 1 and there is an element i G P\R and i G Dj^i/Sj. Let 


S = r\j^uSj after the inner-loop. By assumption and by Lemma C.3 we know S has at least two elements, one of 


them must be i, and call another i'. By the property of P we know there exists a set Sj where i G Sj and i' ^ Sj. 
Clearly j > p (where p is the initial element) as it contains an element outside or R and Sp must contain i' (otherwise 
f will not be in S). 


When the inner-loop goes to j, by Lemma C.2 the dimension of Z' will be smaller than Z. Also, by the robustness 
we know the set at that point contains an element (namely i) that is not in R, so by Lemma C.3 we know ||Ppj_Z'|| > 
a/2. As a result j must be added to U and this contradicts with the fact that in the end f is still in S. □ 

Let S = after the inner-loop, finally we show in the next iteration L will be ey-close to the span of vertices 

in ii U S'. 


Lemma C.5. Let S = flj-gj/Sj after the inner-loop, then in the next iteration, T is ey close to row span ofW^'^^. 

Proof. Based on the hypothesis all the matrices appended to the matrix Y are e„-close to the span of subset of rows of 
W, and the union of all the previous subsets equal to R. Let i? be a matrix that corresponds to the matrix Y with the 
true spans, then \\B — y|| < rcy, and on the other hand the span of B is equal to row span of , with smallest 

nonzero singular value at least a (because W is a-robust). Therefore by Wedin’s theorem we know since re„ a L 
must be ey = 2re„/a-close to the row span of . □ 

The last two lemmas proved the second half of induction hypothesis. Finally it is easy to see that the algorithm 
will not stop as long as P\R is not empty, and it must stop after r iterations because the size of R increases by at least 
1 in every iteration. This concludes the proof of Theorem |6.1| 


Finding the remaining vertices The proof of Theorem |6.2| follows directly from Lemma 4.5 in Arora et al. (2013 1 , 
for completeness we explain the proof here. (e„ < a/20r, 0 {e/a^)) 

Proof, (of Theorem|6.2| First observe that a-robust implies a-robust in 


Arora et al. 


(2013 I, because for any vertex FF*, 


let V be the direction of IL® projected to the orthogonal subspace of kF“® (all the other rows). By a-robust condition 
of this paper we know ||lLu|| > a, which in particular implies > a. 

By Lemma 4.5 in Arora et al. (2013 i, as long as the previously found vertices are at least a/20r-close, and all the 
points are e-close, the new vertex found by the algorithm must be 0(e/a^) close. Since e/a^ ^ e„ we can find all the 
remaining vertices. 

Note that we are not running the clean-up phase of Algorithm 4 FastAnchorWords, this is because the vertices we 
find in this phase is already more accurate than the intersection vertices and the clean-up phase cannot improve the 
quality of the intersection vertices (as they don’t appear in M®). □ 
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D Generative model for subset-separable NMF 


In this section we prove under natural generative model an NMF problem can have (TV, H, 7 )-properly filled facets 
with high probability. 


In order to prove Theorem 7.1 We use the following two lemmas. The first lemma shows with enough uniform 


points in a simplex, with high probability one of them will be a center for the simplex. 


Lemma D.l (Restating Lemma 7.2i. Given n = n{{Ady log d/rj) uniform points ..., t;” in a standard d- 


dimensional simplex (with vertices ei, 62 ,..., e^j, with probability 1 — r] there exists a point Vi such that Vi = 
(Wj > = U and > l/16d. 


Proof. Consider d + 1 subsets of the d-dimensional simplex; let Sq be the set of points that satisfy Vi>l /2d for all 
* G [d]; let Sj (j G [d]) be the set of points that satisfy Vj > 1 — 1/Ad. The volume of these sets are at least (4d)“'^. 
By simple Chernoff bound we know when there are n = n((4d)^ logd/? 7 ) samples, with probability at least 1 — p 
there is a point in each of these sets. 

Next we shall prove the point in Sq is in the convex hull of the points in Sj, and the convex hull satisfies the 
smallest singular value requirement. First we relabel the points, let be any point in Sq and be any point in 
(j f 0). Let V G be the matrix whose columns are u-^’s (j G [d]). We can apply Gershgorin’s Disk Theorem to 
the matrix V'^V (this is a matrix with diagonal entries at least 1 — l/2d and off-diagonal entries at most l/2d), and 
conclude that o'min{y"’"V) > 1/4. 

Since in particular V is full rank, let w = V~^v^. Let Wi be the smallest entry. If < 0 then since — 1 

(all the columns of V and sum up to 1 ), J2j=i ^ 1 ~ Sdit^. The f-th coordinate {Vu)i = < 

{1 — 1/Ad)wi + {l — 2dwi)/Ad < l/4d, which cannot be equal to v/, therefore Wi > 0. In this case since l/2d < v/ = 

{Vw)i = Wjvl < Wj + ^, we know Wi > 1/Ad. Therefore armn{J2j^z Wjiv^){v^V) > jaCrminiVV^) > 
1 m 


Next lemma shows only subspaces that contains a properly filled facet can have many points. 

Lemma D.2 (Restating Lemma 7.3 1 . Given n = ^{rd\og(d/pmin‘d)/Pmin) uniform points ...,t;" in a stan¬ 
dard d-dimensional simplex (with vertices ei, 62 ,..., e^j, with probability 1 — rj for all matrices A G whose 

largest column norm is equal to 1, there are at most p^ini^/'Ipoints with ||4lu®|| < Pmin/‘200d. 


Proof. We first prove this for a particular matrix A, then we will construct an £-net and do union bound over all 
possible matrices A. 

Let u = Ai where Ai is the column with norm 1. For random v that is uniform in the standard d dimensional 
simplex, we will show Pr[|u^Au| < ...] < Pmin/ 8 . By property of uniform distribution on a simplex, we know Vi is 
independent of v-i/(l — vf) (where v-i is the vector v with f-th coordinate removed), and Vi is distributed as a Beta 
distribution Beta{l, d — 1). Let q = vFAv-i/[1 — vf), then we know u^Av = Vi A- (1 — Vi)q and q G [—1,1]. The 
density function of Vi is bounded by d — 1 , therefore for any value q, the probability that \vFAv\ < Pmin/lOOd is at 
mostpmm/ 8 . When the number of samples is at least n = Q.{log{l/rj')/pmin), with probability 1 — rj' there are at 
most pmm?T-/4 points that satisfy < p^m/lOOd. 

Now we construct an e-net so that for any matrix A with largest column norm 1, there is a matrix A' in the e- 
net that is column-wise e-close to A. Set e = Pmm/200d^, by standard construction the number of matrices in the 
e-net is 0{d?/prmn)''''^)- Let rj' = rj/0(d^/pmin)'^ ) (and hence n = fl{rdlog{d/pmin'n)/Pmin), by union bound 
we know with probability 1 — rj, there are at most Pminn/A points with ||^r'*|| < Pmm/lOOd for all matrices A in 
the e-net. For a matrix A that is not in the e-net, let A! be the matrix in the net that is column-wise e-close, clearly 
II Au'll — ||4l'ri® II < pmm/200d. If there are more than pmi„n/4 points with || Ari®|| < pmin/2Q{)d then all these points 
will have ||A'z;®|| < Pmin/lOOd and that is impossible. □ 


With these two lemmas we can now prove the theorem: 

Proof, (of Theorem IZ3 In order to satisfy Condition 1, we apply Lemma [7^ For any proper facet the points are 
equal to the rows of multiplied by uniform random points, since CminiW^') > (^miniW) > o, we know if 
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the facet has more than ri((4d)'^ log d/ 77 ) points the convex combination has smallest singular value o? jXQd. This is 
ensured when the number of samples is at least n = \og{kd/rj)Ipmin) by Chernoff bound. 

Condition 2 is satisfied whenever n = Sl(log(fc/? 7 ) Iprnin) by simple Chernoff bound. 

Condition 3 follows from Lemma 7.3 Suppose Q is a subspace that for any proper facet Qi we have UPg^QiH > 
He. Since W is a-robust this means i II > Hae, therefore there is always a column that has norm Hae/y/r. 
By Lemma 7.3 we know no matter which subspace the point is chosen from, with probability at most Pmin/S> it will 
be e/2-close to the subspace. Now we can apply union bound to the product of all the e-nets constructed for different 
proper facets, so the size of the net is exp(fcdr logd). Therefore we know when n = Vl{kr'^ ^og{r/Pmin'n)/Pmin) 
(here for simplicity we used d = r because in particular the interior points are in a space of dimension r) with high 
probability there will be at most Pminn/i points for this subspace Q. □ 
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